Slides as R Markdown document
mSWEEP code
Slides available under CC-BY 4.0 license, code under MIT license.
22 April 2020
Slides as R Markdown document
mSWEEP code
Slides available under CC-BY 4.0 license, code under MIT license.
Culture -> Isolate -> Culture -> Sequence -> Analyse
Culture -> Sequence -> Analyse
Assume: known clusters \(c\) of bacterial strains and their reference sequences
Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)
Single reads may be ambiguous and belong to multiple strains.
Assume: known clusters \(c\) of bacterial strains and their reference sequences
Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)
Single reads may be ambiguous and belong to multiple strains.
Assume: known clusters \(c\) of bacterial strains and their reference sequences
Task: given sequencing reads \(r_n\), identify the underlying mixture of clusters and their proportions \(\theta\)
Single reads may be ambiguous and belong to multiple strains.
Pre-processing (one time operation):
Analysis (once for each sample):
Setup: isolate sequencing data from multiple species:
Can we identify the true sequence type with mSWEEP?
Figure 2 from Mäklin et al. 2020, Wellcome Open Research).
Setup: synthetically mix the isolate data from the previous slides together. How well do we identify the true source now?
Figure 4 from Mäklin et al. 2020, Wellcome Open Research).
As a practical example, let us look at some mixed sequencing data analyzed in Sankar et al. (Microbial genomics 2016)
We will analyse a small number of clinical sequencing samples of S. aureus that are suspected of contamination because corresponding assemblies are significantly longer than a typical S. aureus genome
To run this yourself, you will first need to install
Pre-processing (one time operation):
Analysis (once for each sample):
Download and extract the S.aureus and Staphylococcus epidermidis sequence assemblies produced in Méric et al. (Genome Biology and Evolution 2015)
wget https://ndownloader.figshare.com/files/22366911 \
-O staphylococcus_reference.tgz
tar -zxvf staphylococcus_reference.tgz
Run mlst to identify the clusters (sequence types)
mlst staphylococcus_reference/*.fna > staph_sts.tsv
Filter out sequences that can’t be assigned to a sequence type
grep -v "[[:blank:]]-[[:blank:]]" staph_sts.tsv > staph_sts_fil.tsv
Concatenate the sequences by adding a 240 bp gap between contigs, and merge the sequences to a single fasta file
gap=$(printf 'N%.0s' {1..240})
while read file; do
name=${file%.fna}
name=${name##*/}
name=">"$name
echo $name >> reference.fasta
sed "s/^>.*/$gap/g" $file | tr -d '\n' | fold >> reference.fasta
echo "" >> reference.fasta
done < <(cut -f1 staph_sts_fil.tsv)
Extract the cluster indicators
cut -f2,3 staph_sts_fil.tsv | sed 's/[[:blank:]]/-ST/g' > clusts.txt
head reference.fasta
## >1076NL ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## CATGGAACTGATGAATATTATTTATATAAAAACTTTAGAAACGTAATAGAAATAGAAAATGAAATGATATATAACAAATA ## TTCACATGGTGAAATCATTGATGATGAATTTTATGGATATCAATTAGATTTAAGTCCTGAAATTTTAGAAGCTTATCAGA ## CATTTTGTAACTTAATTATTGATAAACAAACAGTACCTAGCTTAATTTACTATGATGAAGACGTAAGAGAAGCAGGGTTA ## AGATATTGTGAAGCGGTTGAAAATGAGCTATATAGATTACCTGATAATGAAAAAGTCATTGACAATAGTATATTAAATAT ## CCATAAACTTGGATTAGTGTATCACGGAAAAAATATCTACTTTACGTCGTTACATCCTCTTTTAATAAGATATGAGATAG ## AGAAGACAATGCAACTTAAAGAAAACAAAATTAATGAAAAAGTATTTAGAAAATATAACCCTATTGGTTTATTACCTTAT
grep '>' reference.fasta
## >1076NL ## >1347N ## >19N ## >59N ## >612N1 ## >619N1 ## >619N2 ## >747N1 ## >790N ## >803NLR2 ## >93N ## >94N2B ## >966N ## >AGT24 ## >BUG37 ## >BUG46 ## >COB20 ## >CV13 ## >CV20 ## >CV28 ## >CV45 ## >DEN107 ## >DEN110 ## >DEN116 ## >DEN161 ## >DEN178 ## >DEN185 ## >DEN189 ## >DEN19 ## >DEN22 ## >DEN61 ## >DEN62 ## >DEN69 ## >DEN73 ## >DEN76 ## >GCA_000007645.1_ASM764v1_genomic ## >GCA_000009005.1_ASM900v1_genomic ## >GCA_000009585.1_ASM958v1_genomic ## >GCA_000009645.1_ASM964v1_genomic ## >GCA_000009665.1_ASM966v1_genomic ## >GCA_000010445.1_ASM1044v1_genomic ## >GCA_000010465.1_ASM1046v1_genomic ## >GCA_000011265.1_ASM1126v1_genomic ## >GCA_000011505.1_ASM1150v1_genomic ## >GCA_000011525.1_ASM1152v1_genomic ## >GCA_000011925.1_ASM1192v1_genomic ## >GCA_000012045.1_ASM1204v1_genomic ## >GCA_000013425.1_ASM1342v1_genomic ## >GCA_000013465.1_ASM1346v1_genomic ## >GCA_000016805.1_ASM1680v1_genomic ## >GCA_000017085.1_ASM1708v1_genomic ## >GCA_000017125.1_ASM1712v1_genomic ## >GCA_000024585.1_ASM2458v1_genomic ## >GCA_000025145.2_ASM2514v1_genomic ## >GCA_000027045.1_ASM2704v1_genomic ## >GCA_000144955.1_ASM14495v1_genomic ## >GCA_000145595.1_ASM14559v1_genomic ## >GCA_000146385.1_ASM14638v1_genomic ## >GCA_000149015.1_ASM14901v1_genomic ## >GCA_000153665.1_ASM15366v1_genomic ## >GCA_000159535.2_ASM15953v2_genomic ## >GCA_000159555.1_ASM15955v1_genomic ## >GCA_000159575.1_ASM15957v1_genomic ## >GCA_000160235.1_ASM16023v1_genomic ## >GCA_000160355.1_ASM16035v1_genomic ## >GCA_000160375.1_ASM16037v1_genomic ## >GCA_000160395.1_ASM16039v1_genomic ## >GCA_000160415.1_ASM16041v1_genomic ## >GCA_000162595.1_ASM16259v1_genomic ## >GCA_000162615.1_ASM16261v1_genomic ## >GCA_000162635.1_ASM16263v1_genomic ## >GCA_000162655.1_ASM16265v1_genomic ## >GCA_000162675.1_ASM16267v1_genomic ## >GCA_000162695.1_ASM16269v1_genomic ## >GCA_000162715.1_ASM16271v1_genomic ## >GCA_000162735.1_ASM16273v1_genomic ## >GCA_000162795.1_ASM16279v1_genomic ## >GCA_000162815.1_ASM16281v1_genomic ## >GCA_000162835.1_ASM16283v1_genomic ## >GCA_000162855.1_ASM16285v1_genomic ## >GCA_000163335.1_ASM16333v1_genomic ## >GCA_000163575.1_ASM16357v1_genomic ## >GCA_000164075.1_ASM16407v1_genomic ## >GCA_000164715.1_ASM16471v1_genomic ## >GCA_000171435.1_ASM17143v1_genomic ## >GCA_000171455.1_ASM17145v1_genomic ## >GCA_000172855.1_ASM17285v1_genomic ## >GCA_000174455.1_ASM17445v1_genomic ## >GCA_000174475.1_ASM17447v1_genomic ## >GCA_000174495.1_ASM17449v1_genomic ## >GCA_000174515.1_ASM17451v1_genomic ## >GCA_000174535.1_ASM17453v1_genomic ## >GCA_000174555.1_ASM17455v1_genomic ## >GCA_000174575.1_ASM17457v1_genomic ## >GCA_000174595.1_ASM17459v1_genomic ## >GCA_000174615.1_ASM17461v1_genomic ## >GCA_000174635.1_ASM17463v1_genomic ## >GCA_000174955.1_ASM17495v1_genomic ## >GCA_000175475.1_ASM17547v1_genomic ## >GCA_000175495.1_ASM17549v1_genomic ## >GCA_000175955.1_ASM17595v1_genomic ## >GCA_000176195.1_ASM17619v1_genomic ## >GCA_000177115.1_ASM17711v1_genomic ## >GCA_000177995.1_ASM17799v1_genomic ## >GCA_000178015.1_ASM17801v1_genomic ## >GCA_000178035.1_ASM17803v1_genomic ## >GCA_000180095.1_ASM18009v1_genomic ## >GCA_000180395.1_ASM18039v1_genomic ## >GCA_000187145.1_ASM18714v1_genomic ## >GCA_000187165.1_ASM18716v1_genomic ## >GCA_000189435.2_ASM18943v2_genomic ## >GCA_000189455.2_ASM18945v2_genomic ## >GCA_000193835.2_ASM19383v2_genomic ## >GCA_000193855.2_ASM19385v2_genomic ## >GCA_000193875.2_ASM19387v2_genomic ## >GCA_000204665.1_ASM20466v1_genomic ## >GCA_000205345.2_ASM20534v2_genomic ## >GCA_000205365.2_ASM20536v2_genomic ## >GCA_000205385.2_ASM20538v2_genomic ## >GCA_000205405.2_ASM20540v2_genomic ## >GCA_000205425.2_ASM20542v2_genomic ## >GCA_000208695.1_ASM20869v1_genomic ## >GCA_000210315.1_ASM21031v1_genomic ## >GCA_000212435.2_ASM21243v2_genomic ## >GCA_000215405.2_ASM21540v2_genomic ## >GCA_000215425.2_ASM21542v2_genomic ## >GCA_000221685.2_ASM22168v2_genomic ## >GCA_000221725.2_ASM22172v2_genomic ## >GCA_000221785.2_ASM22178v2_genomic ## >GCA_000221805.2_ASM22180v2_genomic ## >GCA_000221845.2_ASM22184v2_genomic ## >GCA_000237125.1_ASM23712v1_genomic ## >GCA_000237265.1_ASM23726v1_genomic ## >GCA_000239235.1_ASM23923v1_genomic ## >GCA_000239555.2_ASM23955v2_genomic ## >GCA_000239575.2_ASM23957v2_genomic ## >GCA_000239595.2_ASM23959v2_genomic ## >GCA_000239615.2_ASM23961v2_genomic ## >GCA_000239635.2_ASM23963v2_genomic ## >GCA_000239655.2_ASM23965v2_genomic ## >GCA_000239675.2_ASM23967v2_genomic ## >GCA_000242475.2_ASM24247v2_genomic ## >GCA_000242495.2_ASM24249v2_genomic ## >GCA_000242515.2_ASM24251v2_genomic ## >GCA_000242535.2_ASM24253v2_genomic ## >GCA_000242555.2_ASM24255v2_genomic ## >GCA_000242575.2_ASM24257v2_genomic ## >GCA_000245495.1_ASM24549v1_genomic ## >GCA_000245575.2_ASM24557v2_genomic ## >GCA_000245595.2_ASM24559v2_genomic ## >GCA_000245615.2_ASM24561v2_genomic ## >GCA_000245635.2_ASM24563v2_genomic ## >GCA_000245655.2_ASM24565v2_genomic ## >GCA_000245695.2_ASM24569v2_genomic ## >GCA_000247025.2_ASM24702v2_genomic ## >GCA_000247065.2_ASM24706v2_genomic ## >GCA_000247105.2_ASM24710v2_genomic ## >GCA_000247125.2_ASM24712v2_genomic ## >GCA_000247145.2_ASM24714v2_genomic ## >GCA_000247165.2_ASM24716v2_genomic ## >GCA_000247205.2_ASM24720v2_genomic ## >GCA_000247255.2_ASM24725v2_genomic ## >GCA_000247275.2_ASM24727v2_genomic ## >GCA_000247315.2_ASM24731v2_genomic ## >GCA_000247335.2_ASM24733v2_genomic ## >GCA_000247375.2_ASM24737v2_genomic ## >GCA_000247395.2_ASM24739v2_genomic ## >GCA_000247415.2_ASM24741v2_genomic ## >GCA_000248475.2_ASM24847v2_genomic ## >GCA_000248495.2_ASM24849v2_genomic ## >GCA_000248515.2_ASM24851v2_genomic ## >GCA_000248535.2_ASM24853v2_genomic ## >GCA_000248555.2_ASM24855v2_genomic ## >GCA_000248555.3_ASM24855v3_genomic ## >GCA_000248575.2_ASM24857v2_genomic ## >GCA_000248595.2_ASM24859v2_genomic ## >GCA_000248615.2_ASM24861v2_genomic ## >GCA_000248635.2_ASM24863v2_genomic ## >GCA_000248655.2_ASM24865v2_genomic ## >GCA_000248675.2_ASM24867v2_genomic ## >GCA_000248695.2_ASM24869v2_genomic ## >GCA_000248715.2_ASM24871v2_genomic ## >GCA_000248735.2_ASM24873v2_genomic ## >GCA_000248755.2_ASM24875v2_genomic ## >GCA_000248775.2_ASM24877v2_genomic ## >GCA_000248795.2_ASM24879v2_genomic ## >GCA_000248815.2_ASM24881v2_genomic ## >GCA_000248835.2_ASM24883v2_genomic ## >GCA_000248855.2_ASM24885v2_genomic ## >GCA_000248875.2_ASM24887v2_genomic ## >GCA_000248895.2_ASM24889v2_genomic ## >GCA_000248915.2_ASM24891v2_genomic ## >GCA_000248935.2_ASM24893v2_genomic ## >GCA_000248955.2_ASM24895v2_genomic ## >GCA_000248975.2_ASM24897v2_genomic ## >GCA_000248995.2_ASM24899v2_genomic ## >GCA_000249015.2_ASM24901v2_genomic ## >GCA_000249035.2_ASM24903v2_genomic ## >GCA_000252405.2_ASM25240v2_genomic ## >GCA_000253135.1_ASM25313v1_genomic ## >GCA_000257965.1_SaureusISKv1.0_genomic ## >GCA_000257985.1_SaureusISMv1.0_genomic ## >GCA_000258685.1_ASM25868v1_genomic ## >GCA_000259715.1_S_aureus_118_1_genomic ## >GCA_000259995.1_ASM25999v1_genomic ## >GCA_000260015.1_SaureusIS157v1.0_genomic ## >GCA_000260295.1_ASM26029v1_genomic ## >GCA_000260315.1_ASM26031v1_genomic ## >GCA_000262815.1_VRSA01_3_genomic ## >GCA_000262835.1_VRSA02_3_genomic ## >GCA_000262855.1_VRSA03_3_genomic ## >GCA_000262875.1_VRSA04_3_genomic ## >GCA_000262895.1_VRSA05_2_genomic ## >GCA_000262915.1_VRSA06_2_genomic ## >GCA_000262935.1_VRSA07_3_genomic ## >GCA_000262955.1_VRSA08_2_genomic ## >GCA_000262975.1_VRSA09_2_genomic ## >GCA_000262995.2_VRSA10_2_genomic ## >GCA_000263015.1_VRSA11a_2_genomic ## >GCA_000263035.1_VRSA11b_2_genomic ## >GCA_000275965.1_ASM27596v1_genomic ## >GCA_000275985.2_ASM27598v2_genomic ## >GCA_000276005.1_ASM27600v1_genomic ## >GCA_000276025.1_ASM27602v1_genomic ## >GCA_000276045.1_ASM27604v1_genomic ## >GCA_000276065.1_ASM27606v1_genomic ## >GCA_000276085.1_ASM27608v1_genomic ## >GCA_000276105.1_ASM27610v1_genomic ## >GCA_000276125.1_ASM27612v1_genomic ## >GCA_000276145.1_ASM27614v1_genomic ## >GCA_000276165.1_ASM27616v1_genomic ## >GCA_000276185.1_ASM27618v1_genomic ## >GCA_000276205.1_ASM27620v1_genomic ## >GCA_000276225.1_ASM27622v1_genomic ## >GCA_000276245.1_ASM27624v1_genomic ## >GCA_000276265.1_ASM27626v1_genomic ## >GCA_000276285.1_ASM27628v1_genomic ## >GCA_000276305.1_ASM27630v1_genomic ## >GCA_000276325.1_ASM27632v1_genomic ## >GCA_000276345.1_ASM27634v1_genomic ## >GCA_000276365.1_ASM27636v1_genomic ## >GCA_000276385.1_ASM27638v1_genomic ## >GCA_000276405.1_ASM27640v1_genomic ## >GCA_000276425.1_ASM27642v1_genomic ## >GCA_000276445.1_ASM27644v1_genomic ## >GCA_000276465.1_ASM27646v1_genomic ## >GCA_000276485.1_ASM27648v1_genomic ## >GCA_000276505.1_ASM27650v1_genomic ## >GCA_000276525.1_ASM27652v1_genomic ## >GCA_000276545.1_ASM27654v1_genomic ## >GCA_000276625.1_Newbould305_assembly_v1_genomic ## >GCA_000280745.1_S.aureus_GR1_1.0_genomic ## >GCA_000284535.1_ASM28453v1_genomic ## >GCA_000284895.1_ASM28489v1_genomic ## >GCA_000294385.1_ASM29438v1_genomic ## >GCA_000294385.2_ASM29438v2_genomic ## >GCA_000296595.1_ASM29659v1_genomic ## >GRE34 ## >GRE41 ## >HFA6096 ## >HFA6162 ## >HFA6181 ## >HFA6226 ## >HFA6228 ## >HFA6286 ## >HFA6391 ## >HUR105 ## >ICE102 ## >ICE122 ## >ICE18 ## >ICE192 ## >ICE24 ## >ICE26 ## >ICE7 ## >ICE87 ## >ICE9 ## >ICE95 ## >ICE99 ## >ITL299 ## >ITL34 ## >JAP271 ## >MCO150 ## >MEX35 ## >MEX60 ## >MEX61 ## >TAW113 ## >TAW60 ## >URU23
cat clusts.txt
## sepidermidis-ST307 ## sepidermidis-ST2 ## sepidermidis-ST788 ## sepidermidis-ST212 ## sepidermidis-ST556 ## sepidermidis-ST184 ## sepidermidis-ST184 ## sepidermidis-ST171 ## sepidermidis-ST719 ## sepidermidis-ST595 ## sepidermidis-ST329 ## sepidermidis-ST328 ## sepidermidis-ST35 ## sepidermidis-ST23 ## sepidermidis-ST19 ## sepidermidis-ST189 ## sepidermidis-ST51 ## sepidermidis-ST369 ## sepidermidis-ST72 ## sepidermidis-ST44 ## sepidermidis-ST23 ## sepidermidis-ST40 ## sepidermidis-ST66 ## sepidermidis-ST215 ## sepidermidis-ST35 ## sepidermidis-ST5 ## sepidermidis-ST21 ## sepidermidis-ST55 ## sepidermidis-ST1 ## sepidermidis-ST4 ## sepidermidis-ST22 ## sepidermidis-ST230 ## sepidermidis-ST56 ## sepidermidis-ST21 ## sepidermidis-ST14 ## sepidermidis-ST8 ## saureus-ST151 ## saureus-ST398 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST254 ## saureus-ST1 ## saureus-ST36 ## saureus-ST1 ## sepidermidis-ST10 ## saureus-ST250 ## saureus-ST8 ## saureus-ST8 ## saureus-ST105 ## saureus-ST8 ## saureus-ST105 ## saureus-ST5 ## saureus-ST225 ## saureus-ST239 ## saureus-ST93 ## saureus-ST239 ## saureus-ST239 ## saureus-ST1 ## saureus-ST1159 ## saureus-ST4618 ## saureus-ST72 ## sepidermidis-ST59 ## sepidermidis-ST290 ## saureus-ST30 ## saureus-ST30 ## saureus-ST30 ## saureus-ST30 ## saureus-ST30 ## saureus-ST42 ## saureus-ST145 ## saureus-ST10 ## saureus-ST30 ## saureus-ST30 ## saureus-ST30 ## saureus-ST30 ## saureus-ST431 ## saureus-ST30 ## saureus-ST30 ## saureus-ST34 ## saureus-ST30 ## saureus-ST36 ## sepidermidis-ST5 ## saureus-ST1 ## saureus-ST8 ## saureus-ST8 ## saureus-ST239 ## saureus-ST5 ## saureus-ST8 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST1892 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST8 ## saureus-ST8 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## sepidermidis-ST57 ## saureus-ST8 ## saureus-ST105 ## saureus-ST105 ## saureus-ST5 ## saureus-ST5 ## saureus-ST923 ## saureus-ST8 ## saureus-ST700 ## saureus-ST2490 ## saureus-ST30 ## saureus-ST8 ## saureus-ST5 ## saureus-ST239 ## saureus-ST8 ## saureus-ST5 ## saureus-ST25 ## sepidermidis-ST1 ## sepidermidis-ST5 ## saureus-ST239 ## saureus-ST133 ## saureus-ST8 ## saureus-ST25 ## saureus-ST22 ## sepidermidis-ST73 ## saureus-ST707 ## saureus-ST72 ## saureus-ST4166 ## saureus-ST30 ## saureus-ST59 ## saureus-ST425 ## saureus-ST80 ## saureus-ST398 ## saureus-ST109 ## saureus-ST188 ## saureus-ST8 ## saureus-ST372 ## saureus-ST395 ## saureus-ST507 ## saureus-ST45 ## saureus-ST49 ## saureus-ST30 ## saureus-ST1447 ## saureus-ST254 ## saureus-ST80 ## saureus-ST8 ## saureus-ST34 ## saureus-ST88 ## saureus-ST30 ## sepidermidis-ST2 ## sepidermidis-ST8 ## sepidermidis-ST1 ## sepidermidis-ST23 ## sepidermidis-ST22 ## sepidermidis-ST73 ## sepidermidis-ST384 ## sepidermidis-ST23 ## sepidermidis-ST136 ## sepidermidis-ST329 ## saureus-ST8 ## saureus-ST5 ## saureus-ST1356 ## saureus-ST8 ## saureus-ST22 ## saureus-ST4803 ## saureus-ST4166 ## saureus-ST5 ## saureus-ST8 ## saureus-ST5 ## saureus-ST346 ## saureus-ST36 ## saureus-ST36 ## saureus-ST36 ## saureus-ST36 ## saureus-ST5 ## saureus-ST5 ## saureus-ST1 ## saureus-ST8 ## saureus-ST5 ## saureus-ST5 ## saureus-ST36 ## saureus-ST5 ## saureus-ST36 ## saureus-ST8 ## saureus-ST664 ## saureus-ST609 ## saureus-ST45 ## saureus-ST346 ## saureus-ST36 ## saureus-ST45 ## saureus-ST634 ## saureus-ST36 ## saureus-ST8 ## saureus-ST105 ## saureus-ST1 ## saureus-ST15 ## saureus-ST398 ## saureus-ST5 ## sepidermidis-ST130 ## saureus-ST4166 ## saureus-ST398 ## saureus-ST772 ## saureus-ST239 ## saureus-ST239 ## sepidermidis-ST19 ## saureus-ST5510 ## saureus-ST371 ## saureus-ST5 ## saureus-ST5 ## saureus-ST4166 ## saureus-ST231 ## saureus-ST85 ## saureus-ST231 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## saureus-ST5 ## sepidermidis-ST89 ## sepidermidis-ST185 ## sepidermidis-ST430 ## sepidermidis-ST7 ## sepidermidis-ST5 ## sepidermidis-ST5 ## sepidermidis-ST2 ## sepidermidis-ST2 ## sepidermidis-ST2 ## sepidermidis-ST323 ## sepidermidis-ST218 ## sepidermidis-ST324 ## sepidermidis-ST325 ## sepidermidis-ST20 ## sepidermidis-ST326 ## sepidermidis-ST7 ## sepidermidis-ST327 ## sepidermidis-ST328 ## sepidermidis-ST329 ## sepidermidis-ST73 ## sepidermidis-ST170 ## sepidermidis-ST330 ## sepidermidis-ST331 ## sepidermidis-ST331 ## sepidermidis-ST332 ## sepidermidis-ST333 ## sepidermidis-ST334 ## sepidermidis-ST72 ## sepidermidis-ST184 ## sepidermidis-ST72 ## saureus-ST115 ## saureus-ST1692 ## saureus-ST22 ## saureus-ST4803 ## saureus-ST5 ## saureus-ST5 ## saureus-ST398 ## sepidermidis-ST69 ## sepidermidis-ST83 ## sepidermidis-ST184 ## sepidermidis-ST57 ## sepidermidis-ST17 ## sepidermidis-ST60 ## sepidermidis-ST22 ## sepidermidis-ST88 ## sepidermidis-ST73 ## sepidermidis-ST23 ## sepidermidis-ST21 ## sepidermidis-ST215 ## sepidermidis-ST1 ## sepidermidis-ST5 ## sepidermidis-ST38 ## sepidermidis-ST248 ## sepidermidis-ST2 ## sepidermidis-ST40 ## sepidermidis-ST6 ## sepidermidis-ST10 ## sepidermidis-ST20 ## sepidermidis-ST32 ## sepidermidis-ST66 ## sepidermidis-ST781 ## sepidermidis-ST46 ## sepidermidis-ST559 ## sepidermidis-ST61 ## sepidermidis-ST61 ## sepidermidis-ST85 ## sepidermidis-ST59 ## sepidermidis-ST86
Run the ‘build_index’ command from Themisto
mkdir reference_index
mkdir reference_index/tmp
build_index --k 31 --input-file reference.fasta --auto-colors \
--index-dir reference_index --temp-dir reference_index/tmp \
--mem-megas 4192 --n-threads 2
https://www.ebi.ac.uk/ena — the European Nucleotide Archive
Download using your browser from
…or use the provided script
chmod +x get_reads.sh ./get_reads.sh ERR033686
Run the ‘pseudoalign’ command from Themisto
pseudoalign --query-file ERR033686_1.fastq.gz --outfile \
ERR033686_1.aln --index-dir reference_index \
--temp-dir reference_index/tmp --rc \
--n-threads 2 --gzip-output --sort-output
pseudoalign --query-file ERR033686_2.fastq.gz --outfile \
ERR033686_2.aln --index-dir reference_index \
--temp-dir reference_index/tmp --rc \
--n-threads 2 --gzip-output --sort-output
Use mSWEEP to estimate the abundances of the reference clusters
mSWEEP --themisto-1 ERR033686_1.aln.gz --themisto-2 ERR033686_2.aln.gz \
--themisto-index reference_index -i clusts.txt -o ERR033686
sed -n '1,3p' ERR033686_abundances.txt sort -gk2 ERR033686_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 2028476 ## #c_id mean_theta ## saureus-ST5 0.600885 ## saureus-ST398 0.134503 ## saureus-ST4166 0.0923787 ## saureus-ST22 0.0644022 ## saureus-ST1159 0.0468567 ## saureus-ST25 0.0225847 ## saureus-ST4803 0.00864433 ## saureus-ST34 0.00854929 ## saureus-ST72 0.00632708 ## saureus-ST1892 0.00555749
SAMPLES="ERR033658 ERR033686 ERR038357 ERR038367 ERR038366"
for sample in $SAMPLES ; do
./get_reads.sh $sample
pseudoalign --query-file $sample""_1.fastq.gz --outfile \
$sample""_1.aln --index-dir reference_index --temp-dir \
reference_index/tmp --rc --n-threads 2 --gzip-output --sort-output
pseudoalign --query-file $sample""_2.fastq.gz --outfile \
$sample""_2.aln --index-dir reference_index --temp-dir \
reference_index/tmp --rc --n-threads 2 --gzip-output --sort-output
mSWEEP --themisto-1 $sample""_1.aln.gz --themisto-2 $sample""_2.aln.gz \
--themisto-index reference_index -i clusts.txt -o $sample
done
sed -n '1,3p' ERR033658_abundances.txt sort -gk2 ERR033658_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 1693855 ## #c_id mean_theta ## saureus-ST634 0.428317 ## saureus-ST5 0.230168 ## saureus-ST105 0.132684 ## saureus-ST225 0.116941 ## saureus-ST4166 0.0778451 ## saureus-ST85 0.00882182 ## saureus-ST250 0.00252122 ## saureus-ST1159 0.00251821 ## saureus-ST346 2.11807e-05 ## saureus-ST36 1.75996e-05
sed -n '1,3p' ERR033686_abundances.txt sort -gk2 ERR033686_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 2028476 ## #c_id mean_theta ## saureus-ST5 0.600885 ## saureus-ST398 0.134503 ## saureus-ST4166 0.0923787 ## saureus-ST22 0.0644022 ## saureus-ST1159 0.0468567 ## saureus-ST25 0.0225847 ## saureus-ST4803 0.00864433 ## saureus-ST34 0.00854929 ## saureus-ST72 0.00632708 ## saureus-ST1892 0.00555749
sed -n '1,3p' ERR038357_abundances.txt sort -gk2 ERR038357_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 836224 ## #c_id mean_theta ## saureus-ST22 0.999946 ## saureus-ST1 1.76348e-06 ## saureus-ST4803 1.61217e-06 ## saureus-ST10 1.53883e-06 ## saureus-ST923 1.53446e-06 ## saureus-ST72 1.2413e-06 ## saureus-ST8 1.2412e-06 ## saureus-ST772 1.17587e-06 ## saureus-ST115 1.17585e-06 ## saureus-ST254 1.13616e-06
sed -n '1,3p' ERR038367_abundances.txt sort -gk2 ERR038367_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 625499 ## #c_id mean_theta ## saureus-ST5 0.573422 ## saureus-ST1447 0.233878 ## saureus-ST105 0.114883 ## saureus-ST4166 0.0720786 ## saureus-ST634 0.0050561 ## saureus-ST85 0.000289054 ## saureus-ST371 6.37372e-05 ## saureus-ST231 5.41944e-05 ## saureus-ST225 5.36303e-05 ## saureus-ST45 1.29197e-05
sed -n '1,3p' ERR038366_abundances.txt sort -gk2 ERR038366_abundances.txt | grep -v "^[#]" | tail | tail -r
## #mSWEEP_version: v1.4.0 ## #total_hits: 751086 ## #c_id mean_theta ## saureus-ST1892 0.803113 ## saureus-ST45 0.188378 ## saureus-ST1159 0.00301731 ## saureus-ST254 0.00199654 ## saureus-ST1 0.00163599 ## saureus-ST115 0.000933752 ## saureus-ST88 0.000817574 ## saureus-ST188 1.67652e-05 ## saureus-ST25 1.45345e-05 ## saureus-ST395 1.01139e-05
tol15rainbow=c("#114477", "#4477AA", "#77AADD", "#117755", "#44AA88",
"#99CCBB", "#777711", "#AAAA44", "#DDDD77", "#771111",
"#AA4444", "#DD7777", "#771144", "#AA4477", "#DD77AA")
mycols <- c(tol15rainbow[1:13], "#000000", "#555555", "#AAAAAA")
ids <- c("ERR033658", "ERR033686", "ERR038357", "ERR038366", "ERR038367")
files <- paste(ids, "_abundances.txt", sep="")
clusters <- read.table(files[1], stringsAsFactors=FALSE)[, 1]
abundances <- matrix(0, nrow = length(files), ncol = length(clusters))
rownames(abundances) <- ids
colnames(abundances) <- clusters
for (i in 1:length(files)) {
abundances[i, ] <- read.table(files[i])[, 2]
}
Most analyses of interest require either sequence assemblies or isolate sequencing data.
We’ve recently developed a continuation of mSWEEP, called mGEMS, which separates the mixed sequencing data and enables these analyses.
Experiment setup: synthetic mixtures of E. coli, E. faecium, and S. aureus strains.
Results of SNP calling from synthetic mixtures vs. isolate sequencing data
Experiment: synthetic mixtures of E. coli ST131 strains. Phylogeny from isolates (left) and mixtures (right).
A tutorial for using mGEMS to reproduce the E. coli phylogenetic analyses is on GitHub
(Extra) Adapt the instructions from the mGEMS tutorial to the S. aureus data we analysed during the lecture.
If you try the methods and run into problems, you can reach me at tommi.maklin@helsinki.fi.
We’re also happy to hear any feedback on how to make our methods easier to use.